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Abstract 

In this paper an approach based on expectation maximization (EM) clus¬ 
tering to hnd the climate regions and a support vector machine to build a 
predictive model for each of these regions is proposed. To minimize the biases 
in the estimations a ten cross fold validation is adopted both for obtaining 
clusters and building the predictive models. The EM clustering could iden¬ 
tify all the zones as per the Koppen classihcation over Indian region. The 
proposed strategy when employed for predicting temperature has resulted 
in an RMSE of 1.19 in the Montane climate region and 0.89 in the Humid 
Sub Tropical region as compared to 2.9 and 0.95 respectively predicted using 
k-means and linear regression method. 

Keywords: support vector machine, expectation maximization, k-means, 
regression, climate regions, climate change, Koppen classihcation 


1. Introduction 

Regionalization techniques are found to be effective in improving the pre¬ 
diction accuracies of the climate models. Building regional models and pre¬ 
dicting the climate variability require processing and extraction of informa¬ 
tion from large volumes of high dimensional data sets. Data mining methods 
such as k-means (KM) clustering and statistical methods such as linear re¬ 
gression (LR) are popular techniques commonly employed for grouping the 
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data into regions of similar climate and build a model to predict the climate 
variables for subsequent years. The k-means method requires specifying ini¬ 
tial k clusters centers which is generally not known a priori. Also, the pro¬ 
cedure is sensitive to the selection of the initial cluster centers. Moreover, a 
linear regression model may not capture the non-linear relationships among 
the climate variables. 

The EM Ends clusters by finding a appropriate £t for the given data set 
with a mixture of Gaussians. Each of the Gaussians is associated with a 
mean and a covariance matrix. The prior probability for each Gaussian is 
computed as a total fraction of points in the cluster dehned by that Gaussian. 
Based on the iterative approach in updating values for means and variances 
the optimal solution is reached. 

In this paper an approach based on expectation maximization (EM) clus¬ 
tering to hnd the climate regions and a support vector machine to build a 
predictive model for each of these regions is proposed. To minimize the biases 
in the estimations a ten cross fold validation is adopted both for obtaining 
clusters and building the predictive models. 

The following are the main objectives of the present work 

1. Understand the process of climate change over Indian region through 
development of information extraction techniques that can effectively 
predict the climate variability 

2. Develop a methodology for processing the long term gridded climate 
data and obtaining climate regions using expectation maximization 
clustering 

3. Prepare the maps of the climate regions identihed by expectation max¬ 
imization clustering and compare it with standard climate zones as per 
Kppen classihcation over Indian region 

4. Evolve a procedure to subset the long term climate dataset into regional 
data sets 

5. Develop methods to extract the train data set for building the support 
vector regression classiher based on the number of years to predict 

6. Obtain the validation data set for each of the grid locations and com¬ 
pute the root mean squared error. 

7. Gompare the performance of the proposed methodology with k-means 
and linear regression procedure 

This paper is organized as follows. In Section the proposed methodology 
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of predicting climate variables is presented. The experiments and results are 
discussed in Section IH Conclusions and discussion are deferred to Section IH 

2. Methodology 

In the proposed methodology the climate dataset is hrst regionalized by 
applying Expectation maximization clustering using the long term averages 
of the climate variables. Further, a predictive model is developed using sup¬ 
port vector machine SVM regression kernel. A ten cross fold validation is em¬ 
ployed to obtain a robutst estimates of the root mean square error (RMSE). 





Figure 1: A flow chart depicting the procedure for building a predictive model 

The procedure employed in developing a predictive model is shown in 
Figure [1} 

The algorithm [^describes the steps implemented in the present paper for 
obtaining a model for predicting climate variables. 

3. Experiments and Results 

NCEP/NCAR reanalysis data for 65 years from 1948 to 2012 having 
the climate variables Atmospheric Pressure, Relative Humidity, Precipitable 
Water, Zonal Wind, Meridional Wind, Precipitation, Air Temperature is 
used in the analysis 
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Algorithm 1 Procedure for Predicting Climate Variables 
Require: 1. Climate data set 

2. Clustering method 

3. Number of years to predict (p) 

4. Variable to be predicted 

Ensure: 1. Correlation coefficient 

2. Root mean squared error 

Algorithm 

1. Extract long term mean of climate variables for each 2.5° x2.5° grid 
over Indian region 

2. Apply clustering method to obtain regions i?i, R 2 ,. ■ ■, Rn 

3. Build the Model for the variable to be predicted 
(a) For each region in i?i, R 2 ,..., Rn 

i. obtain mean of the climate variables for all the grid points 
in the cluster for j-p years where j denotes total number of 
years and p denotes nnmber of years for which prediction is 
reqnired 

ii. build a support vector machine regression model nsing a ten 
cross fold validation procedure 

4. Test the model built in Step 3 

(a) For each cluster in Ri, R 2 ,..., Rn 

i. For each grid point in the cluster 

A. apply the corresponding model to predict precipitation and 
temperature for years 1,..., p 

B. compute the RMSE using the predicted values and the 
actnal values of the climate variables 

5. RETURN RMSE. 

6. END. 
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Table 1: EM Cluster Centriods of different Climate Zones 


Climate Vari¬ 
able 

Montanenew 

Semi 

Arid 

Tropical 
Wet and 
Dry 

Arid 

Montane 

Tropical 

Wet 

Humid 

Sub 

tropical 

Air Temperature 

12.44 

27.04 

25.79 

25.81 

-2.54 

26.83 

24.8 

Precipitable wa¬ 
ter 

18.86 

41.32 

29.19 

22.02 

6.31 

38.01 

37.61 

Precipitation 

4.8 

3.1 

2.57 

0.67 

2.36 

3.03 

6.4 

Relative Humid¬ 
ity 

81.96 

76.73 

53.05 

35.19 

78.81 

75.51 

74.5 

Sea Level Pres¬ 
sure 

1011.07 

1008.87 

1008.08 

1007.8 

1015.22 

1009.76 

1009.43 

Zonal winds 

0.69 

0.94 

0.57 

1.1 

2.99 

2.67 

0.69 

Meridional winds 

1.02 

1.63 

-0.36 

0.88 

1.82 

-1.01 

0.54 


The application of the EM clustering on the dataset has resulted in 7 
climate regions. As per Koppen Classihcation only there are six regions. The 
present algorithm has brought out a new region consisting of Uttaranchal 
, Sikkim and Arunachal Pradesh out of the existing Montane climate region. 
This we attribute it to the climate change and further investigations are 
required to ascertain these Endings. 

The cluster centroids for the seven regions are shown in Table [1} The 
air temperature in the montanenew regions is very high when compared to 
Montane region the reasons for under investigation. 



Figure 1: Climate Regi 


Obtained using Expectabon Maximization Clustering 


Figure 2: Climate Regions Obtained using Expectation Maximization Clustering Proce¬ 
dure 

The spatial extents of the climate regions obtained from proposed algo¬ 
rithm [T] is shown in Figure]^ 
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Table 2: RMSE error for different climate zones 


Region 

EM+SVM 

KM+LR 

Montanenew 

1.19 

2.9 

Semi Arid 

0.97 

0.88 

Tropical Wet and 
Dry 

3.42 

2.94 

Arid 

0.68 

0.74 

Montane 

1.93 

1.69 

Tropical Wet 

0.55 

0.48 

Humid Sub Trop¬ 
ical 

0.8 

0.95 


The RMSE errors in predicting the temperature for the year 2012 is given 
in Table |2l 



Figure 2: Predicted Temperature for the year 2012 ovrer Indian Region Using Expectation Maximization Clustering 
and Support Vector Machine Classifier 


Figure 3: Predicted temperature for the year 2012 over Indian region obtained using 
Expectation Maximization and SVM Regression Procedure 



Figure 3 : Absolute Error in Predicting the Temperature for the year 2012 over the Indian 


Figure 4: Absolute error in for predicting the temperature the year 2012 over Indian region 
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The spatial maps of the predicted temperature and the absolute error for 
the year 2012 over Indian region is shown in Figures [3||4| 

The EM clustering could identify all the zones as per the Koppen classi- 
hcation over Indian region. The proposed strategy when employed for pre¬ 
dicting temperature has resulted in an RMSE of 1.19 in the Montane climate 
region and 0.89 in the Humid Sub Tropical region as compared to 2.9 and 
0.95 respectively predicted using k-means and linear regression method. 

4. Conclusions and Discussion 

The expectation maximization clustering could identify the different cli¬ 
mate zones as per the Koppen classihcation over Indian region. It is observed 
that the regions of Uttaranchal , Sikkim and Arunachal Pradesh have been 
identihed as a separate group by EM different from the Montane climate 
zone as per Koppen classihcation. This needs further investigations and in¬ 
trospection. 

EM clustering and SVM performed better than k-means and linear regres¬ 
sions only in Humid subtropical and Montane climate zones. It is observed 
the EM performance degrades as the dimensionality of the data set increases 
due to numerical precision problems. 

The fast growing volume of climate datasets and its high-dimensionality 
requires development of novel methods for preprocessing and information 
extraction. The focus of our future work would be on the development of 
techniques for big data climate data analytics. 
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